(Be sure to start this notebook with the command "ipython notebook --pylab inline".)
Section 1.1 of the NLTK book describes some pre-loaded books and pre-defined functions that come with them. Section 1.2 reviews fundamental concepts abßout python lists and strings -- if you need to brush up on these concepts, then study this subsection carefully. Be sure you know the difference between a set and a list and that you can work easily with python slices.
The part that I am most interested in having you focus on is Section 1.3, which introduces NLTK's frequency distribution data structure. You need to have the books loaded and accessible from section 1.1 for this part to work.
This data structure makes it easy to tally up frequencies across words and other items, and incorporate them into list comprehensions (and later we'll see the conditional frequency distribution as well).
The code below counts up all of the words in Monty Python and the Holy Grail (text6 in the nltk.book collection) and the final line shows the top 50 most frequent.
In [1]:
import nltk
from nltk.book import * # loads in pre-defined texts
mp_freqdist = FreqDist(text6) # compute the frequency distribution
mp_freqdist.items()[:50] # show the top 50 (word, frequency) pairs
Out[1]:
Task 1 Wow, those are some weird results. It might make some sense to look at the actual text itself. In the line below, write a line of code that pulls out the first 500 words of the text and shows them to you (hint: the text object is simply a list of strings).
In [2]:
" ".join(text6[:500])
Out[2]:
Task 2 Now that you've looked at the text, what are two reasons for these strange results?
Task 3 Address one of the problems by modifying the text of Monty Python and rerunning the frequency distribution calculation. In the box below write your code to modify the text:
In [3]:
#Create new text that removes the ALL CAPS names
#The check for length is to keep punctuation from being removed since ".".upper() == "."
new_text = [t for t in text6 if len(t) == 1 or t != t.upper()]
Task 4 In the box below, show the output after applying this version of the text to a FreqDist.
In [4]:
new_freq = FreqDist(new_text)
new_freq.items()[:50]
Out[4]:
Task 5 How if at all has the output changed? *Answer: The upper case names have been removed. This has made more room for very short words in the top 50.
Task 6 Following the example from the book, show a cumulative frequency plot for the words in Monty Python as newly computed, in the box below.
In [5]:
new_freq.plot(50 ,cumulative=True)
Task 7 In the box below, write a list comprehension that users the FreqDist you computed above to find all words in Monty Python that are longer than 5 characters long and occur at least 5 times (hint: the text shows how to do a variation of this).
Show the output sorted in alphabetical order.
In [6]:
long_words = [w[0] for w in new_freq.items() if len(w[0]) > 5 and w[1] >= 5]
long_words.sort()
long_words
Out[6]: